1. Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
In [2]:
price1999 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-1990-1999.csv')
price2012 = pd.read_csv('Data/resale-flat-prices-based-on-approval-date-2000-feb-2012.csv')
price2014 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv')
price2016 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv')
price2017 = pd.read_csv('Data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv')
cpi = pd.read_csv('Data/CPI.csv')
In [3]:
price1999.head()
Out[3]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 9000
1 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 IMPROVED 1977 6000
2 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 IMPROVED 1977 8000
3 1990-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 IMPROVED 1977 6000
4 1990-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 NEW GENERATION 1976 47200
In [4]:
price2012.head()
Out[4]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 2000-01 ANG MO KIO 3 ROOM 170 ANG MO KIO AVE 4 07 TO 09 69.0 Improved 1986 147000.0
1 2000-01 ANG MO KIO 3 ROOM 174 ANG MO KIO AVE 4 04 TO 06 61.0 Improved 1986 144000.0
2 2000-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 07 TO 09 73.0 New Generation 1976 159000.0
3 2000-01 ANG MO KIO 3 ROOM 215 ANG MO KIO AVE 1 07 TO 09 73.0 New Generation 1976 167000.0
4 2000-01 ANG MO KIO 3 ROOM 218 ANG MO KIO AVE 1 07 TO 09 67.0 New Generation 1976 163000.0
In [5]:
price2014.head()
Out[5]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price
0 2012-03 ANG MO KIO 2 ROOM 172 ANG MO KIO AVE 4 06 TO 10 45.0 Improved 1986 250000.0
1 2012-03 ANG MO KIO 2 ROOM 510 ANG MO KIO AVE 8 01 TO 05 44.0 Improved 1980 265000.0
2 2012-03 ANG MO KIO 3 ROOM 610 ANG MO KIO AVE 4 06 TO 10 68.0 New Generation 1980 315000.0
3 2012-03 ANG MO KIO 3 ROOM 474 ANG MO KIO AVE 10 01 TO 05 67.0 New Generation 1984 320000.0
4 2012-03 ANG MO KIO 3 ROOM 604 ANG MO KIO AVE 5 06 TO 10 67.0 New Generation 1980 321000.0
In [6]:
price2016.head()
Out[6]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date remaining_lease resale_price
0 2015-01 ANG MO KIO 3 ROOM 174 ANG MO KIO AVE 4 07 TO 09 60.0 Improved 1986 70 255000.0
1 2015-01 ANG MO KIO 3 ROOM 541 ANG MO KIO AVE 10 01 TO 03 68.0 New Generation 1981 65 275000.0
2 2015-01 ANG MO KIO 3 ROOM 163 ANG MO KIO AVE 4 01 TO 03 69.0 New Generation 1980 64 285000.0
3 2015-01 ANG MO KIO 3 ROOM 446 ANG MO KIO AVE 10 01 TO 03 68.0 New Generation 1979 63 290000.0
4 2015-01 ANG MO KIO 3 ROOM 557 ANG MO KIO AVE 10 07 TO 09 68.0 New Generation 1980 64 290000.0
In [7]:
# Merge dfs
prices = pd.concat([price1999, price2012, price2014], sort=False)
prices = pd.concat([prices, price2016, price2017], axis=0, ignore_index=True, sort=False)

prices['month'] = pd.to_datetime(prices['month']) # to datetime

prices.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 826581 entries, 0 to 826580
Data columns (total 11 columns):
month                  826581 non-null datetime64[ns]
town                   826581 non-null object
flat_type              826581 non-null object
block                  826581 non-null object
street_name            826581 non-null object
storey_range           826581 non-null object
floor_area_sqm         826581 non-null float64
flat_model             826581 non-null object
lease_commence_date    826581 non-null int64
resale_price           826581 non-null float64
remaining_lease        117527 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 69.4+ MB
In [8]:
prices[~prices.isnull().any(axis=1)]['month'].dt.year.unique()
Out[8]:
array([2015, 2016, 2017, 2018, 2019, 2020], dtype=int64)

remaining_lease has lots of NAs. They are only available after 2015 sales onwards.

2. Data Cleaning

In [9]:
# Clean flat type
prices['flat_type'] = prices['flat_type'].str.replace('MULTI-GENERATION', 'MULTI GENERATION')
prices['flat_type'].unique()
Out[9]:
array(['1 ROOM', '3 ROOM', '4 ROOM', '5 ROOM', '2 ROOM', 'EXECUTIVE',
       'MULTI GENERATION'], dtype=object)
In [10]:
# Rename flat model duplicates
replace_values = {'NEW GENERATION':'New Generation', 'SIMPLIFIED':'Simplified', 'STANDARD':'Standard', 'MODEL A-MAISONETTE':'Maisonette', 'MULTI GENERATION':'Multi Generation', 'IMPROVED-MAISONETTE':'Executive Maisonette', 'Improved-Maisonette':'Executive Maisonette', 'Premium Maisonette':'Executive Maisonette', '2-ROOM':'2-room', 'MODEL A':'Model A', 'MAISONETTE':'Maisonette', 'Model A-Maisonette':'Maisonette', 'IMPROVED':'Improved', 'TERRACE':'Terrace', 'PREMIUM APARTMENT':'Premium Apartment', 'Premium Apartment Loft':'Premium Apartment', 'APARTMENT':'Apartment', 'Type S1':'Type S1S2', 'Type S2':'Type S1S2'}

prices = prices.replace({'flat_model': replace_values})

prices['flat_model'].value_counts()
Out[10]:
Model A                 228389
Improved                217356
New Generation          177570
Simplified               53960
Standard                 39854
Premium Apartment        35066
Apartment                32004
Maisonette               28798
Model A2                  9109
DBSS                      1609
Adjoined flat             1085
Terrace                    642
Multi Generation           502
Type S1S2                  401
Executive Maisonette       196
2-room                      40
Name: flat_model, dtype: int64

Types of Flat Models:

Standard: (1/2/3/4/5-room). 1960s HDB. Had WC and shower in same room. 5-room Standard were introduced in 1974.
Improved: (1/2/3/4/5-room). Introduced in 1966, the 3/4-room having separate WC and shower, they also featured void decks. 5-room Improved were introduced in 1974.
New Generation: Started first in 1975, New Generation flats can be 3-Room (67 / 82 sqm) or 4-Room (92 sqm), featuring en-suite toilet for master bedroom, with pedestal type Water Closet, plus store room.
Model A: Introduced in 1981: 3-Room (75 sqm), 4-Room (105 sqm), 5-Room (135 sqm), 5-Room Maisonette (139 sqm)
Model A2: Smaller units of Model A. e.g., 4-Room Model A2 (90 sqm)
Simplified: Introduced in 1984: 3-Room (64 sqm), 4-Room (84 sqm)
Multi Generation: 3Gen flats designed to meet the needs of multi-generation families.
Maisonette: AKA Model A Maisonette — 2 storeys HDB flat
Premium Apartment: Introduced somewhere during 1990s, featuring better quality finishes, you get them in ready-to-move condition, with flooring, kitchen cabinets, built-in wardrobes
Executive Maisonette: More premium version of Model A Maisonettes. These units are no longer being built after being replaced by the Executive Condominium (EC) scheme in 1995
Executive Apartment: Executive Apartment / Maisonette (146-150 sqm) were introduced in 1983 and replaced 5-Room Model A flats, in addition of the 3-bedroom and separate living/dining found in 5A flats, EA and EM feature an utility/maid room. 80% of Executive units were Maisonettes and 20% were Apartments.
DBBS: public apartments built under the HDB's short-lived Design, Build and Sell Scheme (DBSS) from 2005 to 2012. They are a unique (and premium) breed of HDB flats in Singapore, which are built by private developers. High Prices. Quite similiar to Executive Condominium except DBBS is like a premium HDB without facilities of private condos
Adjoined Flat: Large HDB flats which are combined from 2 HDB flats
Terrace: HDB terrace flats built before HDB, without realizing Singapore's land constraint. Discontinued
Type S1S2: apartments at The Pinnacle@Duxton are classified as "S" or Special apartments in view of its historical significance and award-winning design. For application of HDB policies, S1 and S2 apartments will be treated as 4-room and 5-room flats respectively
2-room: Most likely refers to 2-room flexi where there is 1 bedroom and 1 common area

In [11]:
prices['storey_range'].unique()
Out[11]:
array(['10 TO 12', '04 TO 06', '07 TO 09', '01 TO 03', '13 TO 15',
       '19 TO 21', '16 TO 18', '25 TO 27', '22 TO 24', '28 TO 30',
       '31 TO 33', '40 TO 42', '37 TO 39', '34 TO 36', '06 TO 10',
       '01 TO 05', '11 TO 15', '16 TO 20', '21 TO 25', '26 TO 30',
       '36 TO 40', '31 TO 35', '46 TO 48', '43 TO 45', '49 TO 51'],
      dtype=object)
In [12]:
prices['town'].unique()
Out[12]:
array(['ANG MO KIO', 'BEDOK', 'BISHAN', 'BUKIT BATOK', 'BUKIT MERAH',
       'BUKIT TIMAH', 'CENTRAL AREA', 'CHOA CHU KANG', 'CLEMENTI',
       'GEYLANG', 'HOUGANG', 'JURONG EAST', 'JURONG WEST',
       'KALLANG/WHAMPOA', 'MARINE PARADE', 'QUEENSTOWN', 'SENGKANG',
       'SERANGOON', 'TAMPINES', 'TOA PAYOH', 'WOODLANDS', 'YISHUN',
       'LIM CHU KANG', 'SEMBAWANG', 'BUKIT PANJANG', 'PASIR RIS',
       'PUNGGOL'], dtype=object)
In [13]:
plt.hist(prices['floor_area_sqm'], bins=50, edgecolor='black')
plt.title('Distribution of HDB Floor Area')
plt.show()
display(prices[prices['floor_area_sqm'] > 200]['flat_model'].value_counts())
Terrace                 61
Maisonette              14
Executive Maisonette     7
Apartment                4
Adjoined flat            1
Name: flat_model, dtype: int64

The floor area outliers mostly belong to special HDBs that are larger than the standard ones. So they might not be outliers in a multivariate sense.

In [14]:
bins = prices['lease_commence_date'].max() - prices['lease_commence_date'].min()
plt.hist(prices['lease_commence_date'], bins=bins, edgecolor='black')
plt.title('Distribution of Lease Commence Year')
plt.show()

2.1. Inflation Adjustment Using CPI

In [15]:
# Compute Resale Price Adjusted for Inflation Using Consumer Price Index for Housing & Utilities
# https://www.singstat.gov.sg/find-data/search-by-theme/economy/prices-and-price-indices/latest-data
cpi['month'] = pd.to_datetime(cpi['month'], format='%Y %b') # to datetime
prices = prices.merge(cpi, on='month', how='left') 
# https://people.duke.edu/~rnau/411infla.htm
prices['real_price'] = (prices['resale_price'] / prices['cpi']) * 100 
In [16]:
# Plot Median Resale Prices Over the Years

# Unadjusted
fig = plt.figure(figsize=(14,4.5))
fig.suptitle('Median HDB Resale Prices Over the Years', fontsize=18)
ax1 = fig.add_subplot(121)
prices.groupby('month')[['resale_price']].median().plot(ax=ax1, legend=None)
ax1.set_xlabel('Date'), ax1.set_ylabel('Resale Price in SGD ($)'), ax1.set_ylim(0, 500000), ax1.set_title('Unadjusted for Inflation', size=15)

# Adjusted
# https://jakevdp.github.io/PythonDataScienceHandbook/04.09-text-and-annotation.html
ax2 = fig.add_subplot(122)
prices.groupby('month')[['real_price']].median().plot(ax=ax2, legend=None)
ax2.set_xlabel('Date'), ax2.set_ylabel('Resale Price in SGD ($)'), ax2.set_ylim(0, 500000), ax2.set_title('Adjusted for Inflation to 2019 SGD',size=15)
ax2.annotate('1997 Asian Financial Crisis\nMedian: $403,766', xy=('1997-05-01',380000), xycoords='data', 
    bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="gray"), xytext=(50,-140), textcoords='offset points', ha='center',
    arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))
ax2.annotate('2013 Cooling Measures\nMedian: $401,887', xy=('2013-07-01',380000), xycoords='data', 
    bbox=dict(boxstyle="round4,pad=.5", fc="none", ec="gray"), xytext=(0,-90), textcoords='offset points', ha='center',
    arrowprops=dict(arrowstyle="->", connectionstyle="angle,angleA=0,angleB=90,rad=20"))  
plt.tight_layout(rect=[0, 0, 0.9, 0.9]) 
plt.show()
#prices.set_index('month').loc['1997']['real_price'].median()

Following the collapse of the thai Baht in July 1997, housing prices in Singapore continue to fall and only started gradually increasing again around 2004. In 2013, it experienced a decline due to 'Propery Market Cooling Measures', such as the Additional Buyer's Stamp Duty (ABSD), Loan-to-Value (LTV) Ratio, and Total Debt Servicing Ratio (TDSR). Refer here for more information.

In [17]:
# Convert remaining_lease to number of years
def getYears(text):
    if isinstance(text, str):
        yearmonth = [int(s) for s in text.split() if s.isdigit()]
        if len(yearmonth) > 1: # if there's year and month
            years = yearmonth[0] + (yearmonth[1]/12)
        else: # if only year
            years = yearmonth[0]
        return years
    else: # if int
        return text

prices['remaining_lease'] = prices['remaining_lease'].apply(lambda x: getYears(x))
In [18]:
prices.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 826581 entries, 0 to 826580
Data columns (total 13 columns):
month                  826581 non-null datetime64[ns]
town                   826581 non-null object
flat_type              826581 non-null object
block                  826581 non-null object
street_name            826581 non-null object
storey_range           826581 non-null object
floor_area_sqm         826581 non-null float64
flat_model             826581 non-null object
lease_commence_date    826581 non-null int64
resale_price           826581 non-null float64
remaining_lease        117527 non-null float64
cpi                    826581 non-null float64
real_price             826581 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1), object(6)
memory usage: 88.3+ MB
In [19]:
bins = prices['remaining_lease'].max() - prices['remaining_lease'].min()
plt.hist(prices['remaining_lease'], bins=int(bins), edgecolor='black')
plt.title('Distribution of Remaining Lease for 2016-2020 Data')
plt.show()
In [20]:
prices.head()
Out[20]:
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease cpi real_price
0 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 Improved 1977 9000.0 NaN 60.894 14779.781259
1 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 04 TO 06 31.0 Improved 1977 6000.0 NaN 60.894 9853.187506
2 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 10 TO 12 31.0 Improved 1977 8000.0 NaN 60.894 13137.583342
3 1990-01-01 ANG MO KIO 1 ROOM 309 ANG MO KIO AVE 1 07 TO 09 31.0 Improved 1977 6000.0 NaN 60.894 9853.187506
4 1990-01-01 ANG MO KIO 3 ROOM 216 ANG MO KIO AVE 1 04 TO 06 73.0 New Generation 1976 47200.0 NaN 60.894 77511.741715

3. Exploratory Data Analysis

3.1. By Flat Type

In [79]:
## Waffle chart for flat type - number of rooms
from pywaffle import  Waffle

flattype = dict(prices['flat_type'].value_counts()/len(prices)*100)
flattype1519 = dict(prices.set_index('month')['2015':'2019'].reset_index()['flat_type'].value_counts()/len(prices.set_index('month')['2015':'2019'].reset_index())*100)

plt.figure(figsize=(10,5),
    FigureClass=Waffle, 
    plots={
        '211': {
            'values': flattype,
            'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
            'title': {'label': 'Proportion of HDB Flat Types (All Years)', 'loc': 'left', 'fontsize':16}
        },
        '212': {
            'values': flattype1519,
            'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize':11},
            'title': {'label': 'Proportion of HDB Flat Types (2015-2019)', 'loc': 'left', 'fontsize':16}            
        },
    },
    rows=5, 
    colors=["#1f77b4", "#ff7f0e", "green", 'purple', 'black', 'yellow', 'brown'],
    icons='home', 
    font_size=22, 
    icon_legend=True)
    
plt.show()

There are not many 1 Room, 2 Room and Multi Generation flats, so they will be removed for looking at flat types.

In [23]:
flattype = ['3 ROOM','4 ROOM','5 ROOM','EXECUTIVE']
prices1519 = prices.set_index('month').sort_index().loc['2015-01':'2019-12']
prices1519 = prices1519[prices1519['flat_type'].isin(flattype)][['flat_type','real_price']].reset_index()
prices1519['flat_type_year'] = prices1519['flat_type'] + ' - ' + prices1519['month'].apply(lambda x: str(x)[:4])
prices1519
Out[23]:
month flat_type real_price flat_type_year
0 2015-01-01 3 ROOM 227794.502559 3 ROOM - 2015
1 2015-01-01 3 ROOM 245660.738054 3 ROOM - 2015
2 2015-01-01 3 ROOM 254593.855802 3 ROOM - 2015
3 2015-01-01 3 ROOM 259060.414675 3 ROOM - 2015
4 2015-01-01 3 ROOM 259060.414675 3 ROOM - 2015
... ... ... ... ...
100056 2019-12-01 EXECUTIVE 577372.953064 EXECUTIVE - 2019
100057 2019-12-01 EXECUTIVE 562440.893933 EXECUTIVE - 2019
100058 2019-12-01 EXECUTIVE 630132.895326 EXECUTIVE - 2019
100059 2019-12-01 EXECUTIVE 785314.817580 EXECUTIVE - 2019
100060 2019-12-01 EXECUTIVE 615200.836195 EXECUTIVE - 2019

100061 rows × 4 columns

In [24]:
import joypy
fig, axes = joypy.joyplot(prices1519, by="flat_type_year", column="real_price",figsize=(9,7),
             linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',4))
axes[-1].set_xlim([0,1200000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k']) 
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()

We don't see much changes within each flat type through the last 5 years. The only consistent pattern is that prices become higher as flats have more rooms, which is unsurprising.

3.2. By Town

In [81]:
## 1997 vs 2019
# all room type
prices['year'] = pd.DatetimeIndex(prices['month']).year # extract out year
prices9719 = prices[prices['year'].isin([1997,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719['change'] = prices9719.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719 = prices9719[prices9719['change'].notnull()] 
prices9719 = prices9719.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719['color'] = prices9719['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

# 4-room
prices9719_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([1997,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices9719_4room['change'] = prices9719_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices9719_4room = prices9719_4room[prices9719_4room.change.notnull()]
prices9719_4room = prices9719_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices9719_4room['color'] = prices9719_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

## 2018 vs 2019
# all room type
prices1819 = prices[prices['year'].isin([2018,2019])].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819['change'] = prices1819.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819 = prices1819[prices1819['change'].notnull()] 
prices1819 = prices1819.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819['color'] = prices1819['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')

# 4-room
prices1819_4room = prices[(prices['flat_type'].isin(['4 ROOM']) & prices['year'].isin([2018,2019]))].groupby(['town','year'], as_index=False).agg({'real_price': 'median'})
prices1819_4room['change'] = prices1819_4room.groupby('town')['real_price'].apply(lambda x: x.pct_change()*100)
prices1819_4room = prices1819_4room[prices1819_4room.change.notnull()]
prices1819_4room = prices1819_4room.sort_values('change', ascending=True).reset_index(drop=True).reset_index()
prices1819_4room['color'] = prices1819_4room['change'].apply(lambda x: '#00ba38' if x > 0 else '#f8766d')
In [26]:
def loll_plot(df, x, y, subtitle, xlabel, xlim):
    plt.rc('axes', axisbelow=True)
    plt.grid(linestyle='--', alpha=0.4)
    plt.hlines(y=df.index, xmin=0, xmax=df[x], color=df.color, linewidth=1)
    plt.scatter(df[x], df.index, color=df.color, s=300)
    for i, txt in enumerate(df[x]):
        plt.annotate(str(round(txt)), (txt, i), color='white', fontsize=9, ha='center', va='center')
    plt.annotate(subtitle, xy=(1, 0), xycoords='axes fraction', fontsize=20,
                    xytext=(-5, 5), textcoords='offset points',
                    ha='right', va='bottom')
    plt.yticks(df.index, df[y]); plt.xticks(fontsize=12); plt.xlim(xlim)
    plt.xlabel(xlabel, fontsize=14)
In [27]:
fig = plt.figure(figsize=(12,7))

ax1 = plt.subplot(121)
loll_plot(prices9719, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-40,125])

ax2 = plt.subplot(122)
loll_plot(prices9719_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-40,125])

fig.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.suptitle('1997 vs 2019, Median Price of Flats', fontsize=16)
plt.show()

Queenstown appears to have one of the highest increases in resale price, which could be due to it being developed over the past 2 decades.

In [28]:
fig = plt.figure(figsize=(12,7))

ax1 = plt.subplot(121)
loll_plot(prices1819, 'change', 'town', 'All Room Types', 'Percent Change (%)', [-30,20])

ax2 = plt.subplot(122)
loll_plot(prices1819_4room, 'change', 'town', '4-Room', 'Percent Change (%)', [-30,20])

fig.tight_layout(pad=0.5, rect=[0, 0, 0.9, 0.9])
plt.suptitle('2018 vs 2019, Median Price of Flats', fontsize=16)
plt.show()

The changes are not very large from 2018 to 2019, although prices for Toa Payoh has dropped and Central Area 4-room flats have also dropped by 20%. Could it be because these areas have older HDB meaning their lease are now shorter? As shown below, it seems that places like Punggol, and Sengkang, tend to have later lease commence date, as they were developed later, which might have led to their slight increase in prices, while places like Toa Payoh and Central Area, tend to have older lease commence date.

In [85]:
prices[prices['year'].isin([2018,2019])].groupby('town')['lease_commence_date'].median().sort_values()
Out[85]:
town
MARINE PARADE      1975
ANG MO KIO         1980
BEDOK              1980
CLEMENTI           1980
KALLANG/WHAMPOA    1982
GEYLANG            1982
TOA PAYOH          1984
JURONG EAST        1984
CENTRAL AREA       1984
SERANGOON          1986
BUKIT MERAH        1986
BUKIT BATOK        1986
YISHUN             1988
BUKIT TIMAH        1988
BISHAN             1988
TAMPINES           1988
HOUGANG            1989
PASIR RIS          1993
QUEENSTOWN         1995
CHOA CHU KANG      1996
JURONG WEST        1997
WOODLANDS          1997
BUKIT PANJANG      1999
SEMBAWANG          2001
SENGKANG           2006
PUNGGOL            2013
Name: lease_commence_date, dtype: int64

3.3. By Storeys

In [29]:
fig = plt.figure(figsize=(12,4))

# Storey Prices
ax1 = plt.subplot(121)
storey = prices.groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey['storey_rank'] = storey['storey_range'].astype('category').cat.codes # label encode
a=sns.scatterplot(x=storey['storey_rank'], y=storey['real_price'], s=storey['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax1)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_xticklabels(pd.Series(['']).append(storey.iloc[[0,5,10,15,20,24],0]))
ax1.set_ylim([280000,1100000]), ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Storey', size=15)
ax1.set_title('All Years', size=15)

# Floor Area Prices
ax2 = plt.subplot(122)
storey2 = prices[prices['year'].isin([2015,2016,2017,2018,2019])].groupby('storey_range')['real_price'].median().reset_index().sort_values(by='storey_range')
storey2['storey_rank'] = storey2['storey_range'].astype('category').cat.codes

b=sns.scatterplot(x=storey2['storey_rank'], y=storey2['real_price'], s=storey2['storey_rank'].astype('int')*30, color='#00994d', edgecolors='w', alpha=0.5, ax=ax2)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000]
ax2.set_yticklabels(ylabels); ax2.set_ylabel('')
ax2.set_xticks([0,4,8,12,16])
ax2.set_xticklabels(storey2.iloc[[0,4,8,12,16],0])
ax2.set_ylim([280000,1100000]), ax2.set_xlabel('Storey', size=15)
ax2.set_title('2015 to 2019', size=15)

plt.show()

Surprisingly, this follows a linear relationship, with higher storeys being sold at a higher price.

3.4. By Floor Area

In [30]:
# Floor Area Prices
area = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
p=sns.regplot(x='floor_area_sqm', y='real_price', data=area, scatter_kws={"s": 1, 'alpha':0.5})
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
p.set_xlabel('Floor Area (Square Meters)', size=15)
plt.close()

In [31]:
display(area[area['floor_area_sqm'] > 200])
month town flat_type block street_name storey_range floor_area_sqm flat_model lease_commence_date resale_price remaining_lease cpi real_price year
712167 2015-03-01 KALLANG/WHAMPOA 3 ROOM 53 JLN MA'MOR 01 TO 03 280.0 Terrace 1972 1060000.0 56.000000 111.472 9.509114e+05 2015
716959 2015-06-01 KALLANG/WHAMPOA 3 ROOM 60 JLN BAHAGIA 01 TO 03 241.0 Terrace 1972 958000.0 56.000000 109.812 8.724001e+05 2015
745544 2016-12-01 KALLANG/WHAMPOA 3 ROOM 57 JLN MA'MOR 01 TO 03 259.0 Terrace 1972 1150000.0 54.000000 103.855 1.107313e+06 2016
755075 2017-06-01 KALLANG/WHAMPOA 3 ROOM 38 JLN BAHAGIA 01 TO 03 215.0 Terrace 1972 830000.0 54.083333 103.097 8.050671e+05 2017
760156 2017-09-01 CHOA CHU KANG EXECUTIVE 641 CHOA CHU KANG ST 64 16 TO 18 215.0 Executive Maisonette 1998 888000.0 79.333333 102.393 8.672468e+05 2017
765900 2017-12-01 KALLANG/WHAMPOA 3 ROOM 65 JLN MA'MOR 01 TO 03 249.0 Terrace 1972 1053888.0 53.583333 101.479 1.038528e+06 2017
767035 2018-01-01 CHOA CHU KANG EXECUTIVE 639 CHOA CHU KANG ST 64 10 TO 12 215.0 Executive Maisonette 1998 900000.0 79.000000 100.274 8.975407e+05 2018
781983 2018-09-01 KALLANG/WHAMPOA 3 ROOM 41 JLN BAHAGIA 01 TO 03 237.0 Terrace 1972 1185000.0 52.833333 101.889 1.163030e+06 2018

Those cases on top right of the chart consists of flats that are either Terrace or Executive Maisonette, which is not surprising.

3.5. By Block Number

3 digit system was introduced in the 1970s, with the 1st digit representing a neighbourhood in a town. So for e.g., AMK neighbourhood 1 starts with 101, and AMK neighbourhood 2 starts with 201. So first digit was separated from last 2 digits and plotted separately

In [32]:
import re

# Block Number Prices
get_num = lambda x: int(re.findall("\d+", x)[0])
prices['blocknum'] = prices['block'].apply(get_num) # get only digits from block number
tmp = prices[prices['blocknum'] > 99] # get only blocks that use 3-digit numbering system
tmp = tmp.groupby('blocknum')['real_price'].median().reset_index()

fig = plt.figure(figsize=(12,4))

ax1 = plt.subplot(121)
a=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[0])), y=tmp['real_price'], color='#ff9933', edgecolors='w', alpha=0.9)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in a.get_yticks()/1000]
ax1.set_yticklabels(ylabels)
ax1.set_ylabel('Resale Price SGD ($)', size=15), ax1.set_xlabel('Neighbourhood Number', size=15)

ax2 = plt.subplot(122)
b=sns.scatterplot(x=tmp['blocknum'].apply(lambda x: int(str(x)[1:])), y=tmp['real_price'], edgecolors='w', alpha=0.9)
ax2.set_yticklabels(ylabels)
ax2.set_ylabel('', size=15)
ax2.set_xlabel('Block Number', size=15)

plt.show()

3.6. By Flat Model

In [33]:
fig = plt.figure(figsize=(12,7))
p=sns.violinplot(x='flat_model', y='real_price', data=prices, width=1,
                order=prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist())
p.set_xticklabels(p.get_xticklabels(), rotation=30, ha='right'), p.set_xlabel('Flat Models', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_yticks()/1000]
p.set_yticklabels(ylabels)
p.set_ylabel('Resale Price SGD ($)', size=15)
plt.show()

The special models like the Type S1S2 (The Pinnacle@Duxton) and Terrace tend to fetch higher prices while the older models from the 1900s tend to go lower.

In [34]:
# ridgeline plot

# tmp = prices.set_index('flat_model')
# tmp = tmp.loc[prices.groupby('flat_model')['real_price'].median().sort_values().reset_index()['flat_model'].tolist()].reset_index().groupby("flat_model", sort=False)
# fig, axes = joypy.joyplot(tmp, by="flat_model", column="real_price",figsize=(12,5),
#              linewidth=0.05,overlap=1.5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
# axes[-1].set_xlim([-50000,1400000])
# axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k']) 
# plt.xlabel('Resale Price SGD ($)', fontsize=14)
# fig.show()

3.7. By Lease Commence Date

In [35]:
fig = plt.figure(figsize=(7,9))
p=sns.boxplot(y='lease_commence_date', x='real_price', data=prices, width=1, orient='h', flierprops = dict(markerfacecolor = 'red', markersize = 0.1, linestyle='none'), linewidth=0.4)
p.set_xlabel('Resale Price SGD ($)', size=15), p.set_ylabel('Lease Commence Year', size=15)
xlabels = ['{:,.0f}'.format(x) + 'K' for x in p.get_xticks()/1000]
p.set_xticklabels(xlabels)
p.set_title('Resale Price By Lease Commence Year', size=15)
plt.show()
In [36]:
tmp = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
fig, axes = joypy.joyplot(tmp, by="lease_commence_date", column="real_price",figsize=(6,10),
             linewidth=1,overlap=5,alpha=0.8,colormap=plt.cm.get_cmap('tab20',16))
axes[-1].set_xlim([-50000,1400000])
axes[-1].set_xticklabels(['0', '200k', '400k', '600k', '800k', '1000k', '1200k', '1400k']) 
plt.xlabel('Resale Price SGD ($)', fontsize=14)
fig.show()

3.8. By Distance to Nearest Amenities

The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and Wikipedia and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude). These coordinates were then fed through other functions which uses geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within a 2km radius of each flat.

Some of Kallang/Whampoa has no distance due to search error on OneMap.sg.

In [37]:
flat_amenities = pd.read_csv('Data/flat_amenities.csv')

prices1519 = prices[prices['year'].isin([2015,2016,2017,2018,2019])]
prices1519['flat'] = prices['block'] + ' ' + prices['street_name']
prices1519 = prices1519.merge(flat_amenities, on='flat', how='left')
d_region = {'ANG MO KIO':'North East', 'BEDOK':'East', 'BISHAN':'Central', 'BUKIT BATOK':'West', 'BUKIT MERAH':'Central',
       'BUKIT PANJANG':'West', 'BUKIT TIMAH':'Central', 'CENTRAL AREA':'Central', 'CHOA CHU KANG':'West',
       'CLEMENTI':'West', 'GEYLANG':'Central', 'HOUGANG':'North East', 'JURONG EAST':'West', 'JURONG WEST':'West',
       'KALLANG/WHAMPOA':'Central', 'MARINE PARADE':'Central', 'PASIR RIS':'East', 'PUNGGOL':'North East',
       'QUEENSTOWN':'Central', 'SEMBAWANG':'North', 'SENGKANG':'North East', 'SERANGOON':'North East', 'TAMPINES':'East',
       'TOA PAYOH':'Central', 'WOODLANDS':'North', 'YISHUN':'North'}
prices1519['region'] = prices1519['town'].map(d_region)
colors = {'North East':'Purple', 'East':'Green', 'Central':'Brown', 'West':'Red', 'North':'Orange'}
In [38]:
tmp = prices1519.groupby('town')[['dist_dhoby','school_dist','num_school_2km','hawker_dist','num_hawker_2km','park_dist','num_park_2km','mall_dist','num_mall_2km','mrt_dist','num_mrt_2km','supermarket_dist','num_supermarket_2km','real_price']].median().reset_index()
tmp['region'] = tmp['town'].map(d_region)

fig, ax = plt.subplots(figsize=(8,5))
grouped = tmp.groupby('region')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='dist_dhoby', y='real_price', label=key, color=colors[key], s=60)
b, a = np.polyfit(tmp['dist_dhoby'], tmp['real_price'], 1)
ax.plot(tmp['dist_dhoby'], a + b* tmp['dist_dhoby'], '-')  
ax.set_xlim([0,20]), ax.set_xlabel('Distance from Dhoby Ghaut MRT (km)', size=15)
ylabels = ['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000]
ax.set_yticklabels(ylabels), ax.set_ylabel('Resale Price SGD ($)', size=15)
for i, txt in enumerate(tmp['town']):
    ax.annotate(txt, (tmp['dist_dhoby'][i]+0.3, tmp['real_price'][i]), size=8, alpha=0.9)

plt.show()

Relationship is negative, with flats that are further away from Dhoby Ghaut MRT (Central), having lower resale prices.

In [39]:
p=sns.pairplot(tmp, x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Distance from Nearest Amenities (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()

Marine Parage is far from MRT station.

In [40]:
p=sns.pairplot(prices1519[prices1519['school_dist']<3], x_vars=["school_dist", "hawker_dist", "park_dist", "mall_dist", "mrt_dist", "supermarket_dist"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e'))) # remove outliers (>3km)
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Distance From School (km)', size=10), axes[0,1].set_xlabel('Distance From Hawker (km)', size=10)
axes[0,2].set_xlabel('Distance From Park (km)', size=10), axes[0,3].set_xlabel('Distance From Mall (km)', size=10)
axes[0,4].set_xlabel('Distance From MRT (km)', size=10), axes[0,5].set_xlabel('Distance From Supermarket (km)', size=10)
plt.suptitle('Resale Price VS Distance from Nearest Amenities')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Surprisingly, the relationship with school is not as strong as expected, while distance from hawker is more pronounced from the median plots.

3.9. By Number of Amenities in 2km Radius

In [41]:
p=sns.pairplot(tmp, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=40)))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price (Median of Each Town) VS Number of Amenities in 2km Radius (Median of Each Town)')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()
In [42]:
p=sns.pairplot(prices1519, x_vars=["num_school_2km", "num_hawker_2km", "num_park_2km", "num_mall_2km", "num_mrt_2km", "num_supermarket_2km"], y_vars=["real_price"], height=3, aspect=1, kind="reg", plot_kws=dict(scatter_kws=dict(s=0.5,alpha=0.3), line_kws=dict(color='#ff7f0e')))
axes=p.axes
ylabels = ['{:,.0f}'.format(x) + 'K' for x in axes[0,0].get_yticks()/1000]
axes[0,0].set_yticklabels(ylabels), axes[0,0].set_ylabel('Resale Price SGD ($)', size=10)
axes[0,0].set_xlabel('Number of Schools', size=10), axes[0,1].set_xlabel('Number of Hawkers', size=10)
axes[0,2].set_xlabel('Number of Parks', size=10), axes[0,3].set_xlabel('Number of Malls', size=10)
axes[0,4].set_xlabel('Number of MRTs', size=10), axes[0,5].set_xlabel('Number of Supermarkets', size=10)
plt.suptitle('Resale Price VS Number of Amenities in 2km Radius')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

Punggol has its own LRT and Central Area has lots of stations and malls nearby.

4. Data Preparation

Focus on only data from 2015 to 2019

In [43]:
# clear unused variables
del(price1999, price2012, price2014, price2016, price2017, prices1819, prices1819_4room, prices9719, prices9719_4room,
    storey, storey2, tmp, xlabels, ylabels, p, grouped, flattype2019, flattype, flat_amenities, cpi, ax, ax1, ax2, area)
import gc
gc.collect()
Out[43]:
235467

4.1. Missing Values

Replace missing distance values with median of the town. Only Kallang/Whampoa has missing data, so the function below will replace them with the median of the Kallang/Whampoa distance variables.

In [44]:
df = prices1519[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model', 'lease_commence_date', 'year', 'school_dist', 'num_school_2km', 'hawker_dist', 'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist', 'num_supermarket_2km', 'dist_dhoby', 'region', 'real_price']]

def replace_NA_median(df, columns):
    for c in columns:      
        df[c] = df.groupby("town").transform(lambda x: x.fillna(x.median()))[c]
    return df

df = replace_NA_median(df, ['school_dist', 'num_school_2km', 'hawker_dist',
       'num_hawker_2km', 'park_dist', 'num_park_2km', 'mall_dist',
       'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist',
       'num_supermarket_2km', 'dist_dhoby'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 101409 entries, 0 to 101408
Data columns (total 22 columns):
town                   101409 non-null object
flat_type              101409 non-null object
storey_range           101409 non-null object
floor_area_sqm         101409 non-null float64
flat_model             101409 non-null object
lease_commence_date    101409 non-null int64
year                   101409 non-null int64
school_dist            101409 non-null float64
num_school_2km         101409 non-null float64
hawker_dist            101409 non-null float64
num_hawker_2km         101409 non-null float64
park_dist              101409 non-null float64
num_park_2km           101409 non-null float64
mall_dist              101409 non-null float64
num_mall_2km           101409 non-null float64
mrt_dist               101409 non-null float64
num_mrt_2km            101409 non-null float64
supermarket_dist       101409 non-null float64
num_supermarket_2km    101409 non-null float64
dist_dhoby             101409 non-null float64
region                 101409 non-null object
real_price             101409 non-null float64
dtypes: float64(15), int64(2), object(5)
memory usage: 17.8+ MB

4.2. Multicollinearity

In [45]:
# Correlation heatmap
fig = plt.figure(figsize=(10,10))
ax = sns.heatmap(df.select_dtypes(include=['int64','float64']).corr(), annot = True, fmt='.2g', 
    vmin=-1, vmax=1, center= 0, cmap= 'coolwarm_r', linecolor='black', linewidth=1, annot_kws={"size": 7})
#ax.set_ylim(0 ,5)
plt.xticks(rotation=45, ha='right')
plt.title('Correlation Heatmap')
fig.show()
In [46]:
# Multicollinearity
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):
    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['tolerance'] = 1/vif.VIF
    vif['meanVIF'] = vif.VIF.mean()

    return(vif)

calc_vif(df.drop('real_price',axis=1).select_dtypes(include=['int64','float64']))
Out[46]:
variables VIF tolerance meanVIF
0 floor_area_sqm 19.894200 0.050266 5548.765039
1 lease_commence_date 44562.410995 0.000022 5548.765039
2 year 44064.404189 0.000023 5548.765039
3 school_dist 4.706948 0.212452 5548.765039
4 num_school_2km 15.415693 0.064869 5548.765039
5 hawker_dist 10.903288 0.091715 5548.765039
6 num_hawker_2km 5.347933 0.186988 5548.765039
7 park_dist 6.314078 0.158376 5548.765039
8 num_park_2km 5.175563 0.193216 5548.765039
9 mall_dist 6.065183 0.164875 5548.765039
10 num_mall_2km 6.806727 0.146913 5548.765039
11 mrt_dist 5.348409 0.186972 5548.765039
12 num_mrt_2km 10.385980 0.096284 5548.765039
13 supermarket_dist 4.711751 0.212235 5548.765039
14 num_supermarket_2km 27.097202 0.036904 5548.765039
15 dist_dhoby 25.252494 0.039600 5548.765039
In [47]:
calc_vif(df.drop(['real_price','num_supermarket_2km','year','num_school_2km','dist_dhoby'],axis=1).select_dtypes(include=['int64','float64']))
Out[47]:
variables VIF tolerance meanVIF
0 floor_area_sqm 19.454717 0.051401 10.562684
1 lease_commence_date 53.075649 0.018841 10.562684
2 school_dist 4.350898 0.229838 10.562684
3 hawker_dist 8.311330 0.120318 10.562684
4 num_hawker_2km 4.163071 0.240207 10.562684
5 park_dist 6.172912 0.161998 10.562684
6 num_park_2km 4.309959 0.232021 10.562684
7 mall_dist 5.474837 0.182654 10.562684
8 num_mall_2km 5.810105 0.172114 10.562684
9 mrt_dist 5.312813 0.188224 10.562684
10 num_mrt_2km 5.916204 0.169027 10.562684
11 supermarket_dist 4.399711 0.227288 10.562684
In [48]:
# drop columns
lr_df = df.drop(['num_supermarket_2km','year','num_school_2km','dist_dhoby'], axis=1)

4.3. Normality

In [49]:
# Plot distribution for each continuous variable
lr_df.hist(bins=50, figsize=(15, 10), grid=False, edgecolor='black')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.show()

Not all the variables follow a normal distribution, and most of the distances variables have some outliers. For these outliers, its better to use a multivariable approach like mahalanobis distance to see if they are outliers instead of using a univariate approach. If needed, some of these variables can be transformed to reduce the skewness.

In [50]:
from statsmodels.api import qqplot
fig, ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2,figsize=(5,5))

ax1.hist(lr_df['real_price'], bins=50, edgecolor='black')
qqplot(lr_df['real_price'], line='s', ax=ax2)
ax3.hist(np.log(lr_df['real_price']), bins=50, edgecolor='black')
qqplot(np.log(lr_df['real_price']), line='s', ax=ax4)
plt.suptitle('Real Price Normality Before (Top) & After (Bottom) Logging')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

We will fit the linear regression first and come back to check on the normality of the residuals as well as homoscedasticity.

4.4. Label & Dummy Encoding

In [51]:
# Frequency plots for catergorical features
fig = plt.figure(figsize=(30,5))
for count, col in enumerate(df.select_dtypes(include=['category','object']).columns):
    fig.add_subplot(1,5,count+1)
    df[col].value_counts().plot.barh()
    plt.title(col)
    plt.yticks(rotation=45)
    plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])

Region might be a better choice instead of town since town has lots of classes and one-hot encoding it might lead to a very sparse matrix. For flat_type, we can remove Multi Generation and 1 Room since there are not many instances of them. storey_range will be label encoded according to their levels while flat_model should be further grouped to reduce the number of classes.

In [52]:
# label encode storeys
df = df.sort_values(by='storey_range')
df['storey_range'] = df['storey_range'].astype('category').cat.codes # label encode
lr_df = lr_df.sort_values(by='storey_range')
lr_df['storey_range'] = lr_df['storey_range'].astype('category').cat.codes # label encode

# remove flat types with very few cases
df = df[~df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]
lr_df = lr_df[~lr_df['flat_type'].isin(['MULTI GENERATION', '1 ROOM'])]

# Re-categorize flat model to reduce num classes
replace_values = {'Executive Maisonette':'Maisonette', 'Terrace':'Special', 'Adjoined flat':'Special', 
                    'Type S1S2':'Special', 'DBSS':'Special', 'Model A2':'Model A', 'Premium Apartment':'Apartment', 'Improved':'Standard', 'Simplified':'Model A', '2-room':'Standard'}
df = df.replace({'flat_model': replace_values})
lr_df = lr_df.replace({'flat_model': replace_values})

# Label encode flat type
replace_values = {'2 ROOM':0, '3 ROOM':1, '4 ROOM':2, '5 ROOM':3, 'EXECUTIVE':4}
df = df.replace({'flat_type': replace_values})
lr_df = lr_df.replace({'flat_type': replace_values})

df = df.reset_index(drop=True)
display(df['flat_model'].value_counts())
lr_df = lr_df.reset_index(drop=True)
display(lr_df['flat_model'].value_counts())
Model A           37973
Standard          28539
New Generation    15389
Apartment         14532
Maisonette         3151
Special            1744
Name: flat_model, dtype: int64
Model A           37973
Standard          28539
New Generation    15389
Apartment         14532
Maisonette         3151
Special            1744
Name: flat_model, dtype: int64
In [53]:
display(lr_df.head())
town flat_type storey_range floor_area_sqm flat_model lease_commence_date school_dist hawker_dist num_hawker_2km park_dist num_park_2km mall_dist num_mall_2km mrt_dist num_mrt_2km supermarket_dist region real_price
0 TAMPINES 2 0 108.0 Model A 1985 0.359916 1.015992 1.0 0.591787 8.0 0.222197 6.0 0.265708 5.0 0.197700 East 470763.371323
1 BUKIT BATOK 2 0 103.0 Model A 1986 0.428729 2.476292 0.0 0.976805 10.0 1.395023 4.0 0.542323 6.0 0.334088 West 368625.028643
2 JURONG WEST 1 0 74.0 Model A 1984 0.174088 0.441678 4.0 0.725144 4.0 0.907439 3.0 0.229473 3.0 0.467091 West 253396.280952
3 JURONG WEST 1 0 74.0 Model A 1986 0.409651 0.095477 3.0 1.147330 4.0 0.772041 3.0 0.753994 3.0 0.194830 West 263833.992095
4 JURONG WEST 1 0 74.0 Model A 1985 0.437646 0.522134 4.0 1.329356 3.0 1.276048 2.0 0.806693 2.0 0.301456 West 261857.707510
In [54]:
## dummy encoding
df = pd.get_dummies(df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
df = pd.get_dummies(df, columns=['flat_model'], prefix=['model'])
df= df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline
lr_df = pd.get_dummies(lr_df, columns=['region'], prefix=['region'], drop_first=True) # central is baseline
lr_df = pd.get_dummies(lr_df, columns=['flat_model'], prefix=['model'])
lr_df= lr_df.drop('model_Standard',axis=1) # remove standard, setting it as the baseline

4.5. Feature Scaling

Scaling is only done for linear regression. Tree-based models do not require scaling as it does not affect performance.

In [55]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit to continuous columns and transform
scaled_columns = ['floor_area_sqm','lease_commence_date','school_dist','hawker_dist','num_hawker_2km','park_dist',
                    'num_park_2km', 'mall_dist', 'num_mall_2km', 'mrt_dist', 'num_mrt_2km', 'supermarket_dist']
scaler.fit(lr_df[scaled_columns])
scaled_columns = pd.DataFrame(scaler.transform(lr_df[scaled_columns]), index=lr_df.index, columns=scaled_columns)

# separate unscaled features
unscaled_columns = lr_df.drop(scaled_columns, axis=1)

# concatenate scaled and unscaled features
lr_df = pd.concat([scaled_columns,unscaled_columns], axis=1)
In [56]:
display(lr_df.head())
floor_area_sqm lease_commence_date school_dist hawker_dist num_hawker_2km park_dist num_park_2km mall_dist num_mall_2km mrt_dist ... real_price region_East region_North region_North East region_West model_Apartment model_Maisonette model_Model A model_New Generation model_Special
0 0.434709 -0.637010 -0.173252 -0.584115 -0.481922 -0.483176 -0.183114 -1.174579 0.329413 -0.920217 ... 470763.371323 1 0 0 0 0 0 1 0 0
1 0.227901 -0.553487 0.112022 0.343105 -0.838824 0.416079 0.104527 1.881664 -0.309844 -0.199828 ... 368625.028643 0 0 0 1 0 0 1 0 0
2 -0.971581 -0.720534 -0.943631 -0.948777 0.588787 -0.171706 -0.758398 0.611078 -0.629472 -1.014584 ... 253396.280952 0 0 0 1 0 0 1 0 0
3 -0.971581 -0.553487 0.032932 -1.168598 0.231884 0.814358 -0.758398 0.258249 -0.629472 0.351427 ... 263833.992095 0 0 0 1 0 0 1 0 0
4 -0.971581 -0.637010 0.148990 -0.897691 0.588787 1.239501 -0.902218 1.571630 -0.949100 0.488672 ... 261857.707510 0 0 0 1 0 0 1 0 0

5 rows × 25 columns

5. Linear Regression

5.1. Model Building & Results

In [57]:
from sklearn.linear_model import LinearRegression

lr_y = lr_df[['real_price']]
lr_X = lr_df.drop(['real_price','town'], axis=1)

lin_reg = LinearRegression()
lin_reg.fit(lr_X, np.log(lr_y))

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(lr_X, np.log(lr_y))}')
Coefficients: [[ 1.90370946e-01  1.14764184e-01  8.40369814e-03 -7.04347510e-02
   1.14008493e-02 -1.09128413e-02  3.32159293e-02 -1.02476645e-02
  -2.55313678e-03 -3.31313117e-02 -4.46424168e-05 -1.14179735e-02
   5.64211009e-02  2.49717537e-02 -1.79450080e-01 -3.40152643e-01
  -2.17444228e-01 -2.45381074e-01  1.65749830e-02  5.19352410e-02
   3.07136822e-02  6.35880849e-02  1.73230488e-01]]
Intercept: [12.90390102]
R^2 score: 0.8649492205148165
In [58]:
import statsmodels.api as sm
# alternate way using statistical formula, which does not require dummy coding manually
# https://stackoverflow.com/questions/50733014/linear-regression-with-dummy-categorical-variables
# https://stackoverflow.com/questions/34007308/linear-regression-analysis-with-string-categorical-features-variables

X_constant = sm.add_constant(lr_X)
lin_reg = sm.OLS(np.log(lr_y),X_constant).fit()
lin_reg.summary()
Out[58]:
OLS Regression Results
Dep. Variable: real_price R-squared: 0.865
Model: OLS Adj. R-squared: 0.865
Method: Least Squares F-statistic: 2.821e+04
Date: Mon, 09 Nov 2020 Prob (F-statistic): 0.00
Time: 16:34:36 Log-Likelihood: 75064.
No. Observations: 101328 AIC: -1.501e+05
Df Residuals: 101304 BIC: -1.499e+05
Df Model: 23
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 12.9039 0.004 3600.429 0.000 12.897 12.911
floor_area_sqm 0.1904 0.001 149.087 0.000 0.188 0.193
lease_commence_date 0.1148 0.001 201.851 0.000 0.114 0.116
school_dist 0.0084 0.000 21.071 0.000 0.008 0.009
hawker_dist -0.0704 0.001 -97.203 0.000 -0.072 -0.069
num_hawker_2km 0.0114 0.001 18.012 0.000 0.010 0.013
park_dist -0.0109 0.000 -25.370 0.000 -0.012 -0.010
num_park_2km 0.0332 0.001 61.818 0.000 0.032 0.034
mall_dist -0.0102 0.000 -23.293 0.000 -0.011 -0.009
num_mall_2km -0.0026 0.000 -5.219 0.000 -0.004 -0.002
mrt_dist -0.0331 0.000 -73.474 0.000 -0.034 -0.032
num_mrt_2km -4.464e-05 0.001 -0.061 0.952 -0.001 0.001
supermarket_dist -0.0114 0.000 -29.339 0.000 -0.012 -0.011
flat_type 0.0564 0.001 38.336 0.000 0.054 0.059
storey_range 0.0250 0.000 116.520 0.000 0.025 0.025
region_East -0.1795 0.001 -120.932 0.000 -0.182 -0.177
region_North -0.3402 0.002 -197.841 0.000 -0.344 -0.337
region_North East -0.2174 0.002 -135.436 0.000 -0.221 -0.214
region_West -0.2454 0.002 -153.284 0.000 -0.249 -0.242
model_Apartment 0.0166 0.001 12.372 0.000 0.014 0.019
model_Maisonette 0.0519 0.002 21.990 0.000 0.047 0.057
model_Model A 0.0307 0.001 27.575 0.000 0.029 0.033
model_New Generation 0.0636 0.001 45.556 0.000 0.061 0.066
model_Special 0.1732 0.003 56.716 0.000 0.167 0.179
Omnibus: 4827.972 Durbin-Watson: 1.775
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8780.373
Skew: 0.376 Prob(JB): 0.00
Kurtosis: 4.231 Cond. No. 41.3


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [59]:
ax = sns.scatterplot(x=np.log(lr_y)['real_price'], y=lin_reg.predict(), edgecolors='w', alpha=0.9, s=8)
ax.set_xlabel('Observed')#, ax.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_xticks()/1000])
ax.set_ylabel('Predicted')#, ax.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax.get_yticks()/1000])
ax.annotate('Adjusted R\u00b2: ' + str(round(lin_reg.rsquared_adj,2)), xy=(0, 1), xytext=(25, -25),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
plt.close()

5.2. Homoscedasticity and Normality of Residuals

In [60]:
# Homoscedasticity and Normality of Residuals
pred = lin_reg.predict()
resids = lin_reg.resid
resids_studentized = lin_reg.get_influence().resid_studentized_internal

fig = plt.figure(figsize=(10,3))

ax1 = plt.subplot(121)
sns.scatterplot(x=pred, y=resids_studentized, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Predicted Values')
ax1.set_ylabel('Studentized Residuals')

ax2 = plt.subplot(122)
sns.distplot(resids_studentized, norm_hist=True, hist_kws=dict(edgecolor='w'))
ax2.set_xlabel('Studentized Residual')

plt.close()

5.3. Feature Importance

In [61]:
# Feature Importance

lr_results = pd.read_html(lin_reg.summary().tables[1].as_html(),header=0,index_col=0)[0]
coefs = lr_results[['coef']][1:].reset_index().rename(columns={'index':'feature'})
coefs['feature_importance'] = np.abs(coefs['coef'])
coefs = coefs.sort_values('feature_importance').reset_index(drop=True)
coefs['color'] = coefs['coef'].apply(lambda x: '#66ff8c' if x>0 else '#ff8c66')
coefs.plot.barh(x='feature',y='feature_importance',color=coefs['color'],figsize=(6,7))
colors = {'Positive':'#66ff8c', 'Negative':'#ff8c66'}         
labels = list(colors.keys())
handles = [plt.Rectangle((0,0),1,1, color=colors[label]) for label in labels]
legend = plt.legend(handles, labels, title='Relationship', fontsize = '15')
plt.setp(legend.get_title(),fontsize='17')
plt.xlabel('Standardized Coefficients', size=15), plt.ylabel('Features', size=15)
plt.ylim([-1,23])
plt.title('Linear Regression Feature Importance', size=15)
plt.show()

It seems that region is the feature that drives resale prices the most. The Central region was used as the baseline against which all other regions are compared to. This means that all the other regions tend to be sold lower as compared to the Central area. Floor area, lease commence date and special flat models are also positive drivers of resale prices while distance to nearest hawker center is a negative driver.

Let's look at whether a non-linear model will show consistent drivers for resale prices.

6. Random Forest

In [62]:
from sklearn.model_selection import train_test_split

y = df[['real_price']]
X = df.drop(['real_price','town', 'year'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1, shuffle=True, random_state=0)
print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)
Shape of X_train: (91195, 26)
Shape of X_test: (10133, 26)
Shape of y_train: (91195, 1)
Shape of y_test: (10133, 1)

6.1. Out-Of-Bag

In [63]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error
from scipy.stats import spearmanr, pearsonr

rf = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=0)
rf.fit(X_train, y_train)
predicted_train = rf.predict(X_train)

print(f'Out-of-bag R\u00b2 score estimate: {rf.oob_score_:>5.3}')
Out-of-bag R² score estimate: 0.957
In [64]:
predicted_test = rf.predict(X_test)
oob_test_score = r2_score(y_test['real_price'], predicted_test)
spearman = spearmanr(y_test['real_price'], predicted_test)
pearson = pearsonr(y_test['real_price'], predicted_test)
oob_mae = mean_absolute_error(y_test['real_price'], predicted_test)

print(f'Test data R\u00b2 score: {oob_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(oob_mae)}')
Test data R² score: 0.958
Test data Spearman correlation: 0.976
Test data Pearson correlation: 0.979
Test data Mean Absolute Error: 21328.0

6.2. K-fold Cross Validation

In [65]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_features': ['auto'], # max number of features considered for splitting a node
    'max_depth': [20], # max number of levels in each decision tree
    'min_samples_split': [15], # min number of data points placed in a node before the node is split
    'min_samples_leaf': [2]} # min number of data points allowed in a leaf node
rfr =GridSearchCV(RandomForestRegressor(n_estimators = 500, n_jobs=-1, random_state=0),
                        param_grid, cv=10, scoring='r2', return_train_score=True)
rfr.fit(X_train,y_train)
print("Best parameters set found on Cross Validation:\n\n", rfr.best_params_)
print("\nCross Validation R\u00b2 score:\n\n", rfr.best_score_.round(3))
Best parameters set found on Cross Validation:

 {'max_depth': 20, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 15}

Cross Validation R² score:

 0.961
In [66]:
cv_predicted_test = rfr.predict(X_test)

cv_test_score = r2_score(y_test['real_price'], cv_predicted_test)
spearman = spearmanr(y_test['real_price'], cv_predicted_test)
pearson = pearsonr(y_test['real_price'], cv_predicted_test)
cv_mae = mean_absolute_error(y_test['real_price'], cv_predicted_test)

print(f'Test data R\u00b2 score: {cv_test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
print(f'Test data Mean Absolute Error: {round(cv_mae)}')
Test data R² score: 0.962
Test data Spearman correlation: 0.979
Test data Pearson correlation: 0.981
Test data Mean Absolute Error: 20243.0
In [67]:
fig = plt.figure(figsize=(13,4))

ax1 = plt.subplot(121)
ax1 = sns.scatterplot(x=y_test['real_price'], y=predicted_test, edgecolors='w', alpha=0.9, s=8)
ax1.set_xlabel('Observed'), ax1.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_xticks()/1000])
ax1.set_ylabel('Predicted'), ax1.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax1.get_yticks()/1000])
ax1.annotate('Test R\u00b2: ' + str(round(oob_test_score,3)) + '\nTest MAE: ' + str(round(oob_mae)), xy=(0, 1), xytext=(25, -35),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax1.set_title('Tuned Using Out-Of-Bag')

ax2 = plt.subplot(122)
ax2 = sns.scatterplot(x=y_test['real_price'], y=cv_predicted_test, edgecolors='w', alpha=0.9, s=8)
ax2.set_xlabel('Observed'), ax2.set_xticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_xticks()/1000])
ax2.set_ylabel('Predicted'), ax2.set_yticklabels(['{:,.0f}'.format(x) + 'K' for x in ax2.get_yticks()/1000])
ax2.annotate('Test R\u00b2: ' + str(round(cv_test_score,3)) + '\nTest MAE: ' + str(round(cv_mae)), xy=(0, 1), xytext=(25, -35),
    xycoords='axes fraction', textcoords='offset points', fontsize=12)
ax2.set_title('Tuned Using Cross Validation')
plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
plt.close()

6.3. Feature Importance

In [68]:
fig = plt.figure(figsize=(14,7))

ax1 = plt.subplot(121)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rf.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
#plt.xticks(rotation=45, ha='right')
ax1.set_title('OOB Feature Importance', size=15)

ax2 = plt.subplot(122)
feat_imp = pd.DataFrame({'Features': X_train.columns, 'Feature Importance': rfr.best_estimator_.feature_importances_}).sort_values('Feature Importance', ascending=False)
sns.barplot(y='Features', x='Feature Importance', data=feat_imp)
ax2.set_title('CV Feature Importance', size=15)

plt.tight_layout(pad=0, rect=[0, 0, 0.9, 0.9])
fig.show()

The feature importance are abit different from the linear regression model. Floor area and lease commence date are still one of the main drivers of resale prices. However, features like distance from Dhoby Ghaut MRT and flat type also appears to be good drivers.

Tree-based models seem to give lower importance to categorical values. This is due to the importance score being a measure of how often the feature was selected for splitting and how much gain in purity was achieved as a result of the selection.

6.4. SHAP Values

We can also look at feature importance using SHAP (SHapley Additive exPlanations) values first proposed by Lundberg and Lee (2006) for model interpretability of any machine learning model. SHAP values have a few advantages:

  1. Directionality — Unlike the feature importance from random forest, SHAP values allows us to see the importance of a feature and the direction in which it influences the outcome
  2. Global interpretability — the collective SHAP values can show how much each predictor contributes to the entire dataset (this is not shown here as it takes a long time for a large dataset)
  3. Local interpretability — each observation gets its own SHAP values, allowing us to identify which features are more important for each observation
  4. SHAP values can be calculated for any tree-based model, which other methods are not able to do

Below shows the contributors to resale price for an example of low, middle and high resale price flats.

Flats with predicted low resale price

In [69]:
import shap
shap.initjs()

explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[16]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[16]])
Out[69]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [70]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[5]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[5]])
Out[70]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Flats with predicted medium resale price

In [71]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[1]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[1]])
Out[71]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [72]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[100]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[100]])
Out[72]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Flats with predicted high resale price

In [73]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[172]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[172]])
Out[73]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [74]:
explainer = shap.TreeExplainer(rfr.best_estimator_)
shap_values = explainer.shap_values(X_test.iloc[[10084]])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[[10084]])
Out[74]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [75]:
print("Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4")
Flat Type Encoding = 2 ROOM:0, 3 ROOM:1, 4 ROOM:2, 5 ROOM:3, EXECUTIVE:4